Biology Methods and Protocols
◐ Oxford University Press (OUP)
Preprints posted in the last 30 days, ranked by how well they match Biology Methods and Protocols's content profile, based on 53 papers previously published here. The average preprint has a 0.08% match score for this journal, so anything above that is already an above-average fit.
Jayme, A.; Heuveline, V.
Show abstract
Background and ObjectiveGlioblastoma outcome prediction remains difficult because clinically relevant signals are distributed across heterogeneous imaging and genomic modalities, cohorts are small, and conventional neural predictors do not quantify their own uncertainty. This study evaluates a hybrid neural-Bayesian belief network framework for uncertainty-aware multimodal glioblastoma prediction and examines how modality selection, model family, and structure-aware regularization affect predictive performance and confidence quality. MethodsThe framework was evaluated on the TCGA-GBM radiogenomic cohort using four input modalities (T1Gd, FLAIR, mRNA, and CNA), five model families, five structural-weight settings, and 15 view subsets. A secondary benchmark on the UCI Human Activity Recognition dataset was included to assess whether observed limitations were specific to the glioblastoma setting. ResultsCNA features consistently reduced performance in most multimodal settings, and selective fusion excluding CNA outperformed both the full four-view baseline and imaging-only alternatives. Model families showed clear differences in uncertainty behaviour: non-Bayesian families achieved the strongest predictive accuracy, whereas the Bayesian family achieved the lowest calibration error over a narrower confidence range. Bayesian belief network regularization produced consistent directional improvements without supporting reliable structure-discovery claims, as learned graph structures were not reproducible across folds. On the secondary bench-mark, the same framework achieved much higher predictive performance, indicating that the glioblastoma performance ceiling primarily reflects data limitations rather than an architectural constraint. ConclusionsIn small-sample radiogenomic prediction, modality choice is at least as important as model choice, and uncertainty quality differs substantially across uncertainty-aware model families. The proposed framework provides a practical basis for comparing accuracy, calibration, modality selection, and structure-aware regularization in multimodal biomedical prediction.
Brondani, M.; Garbin, J. R.; Soheilipour, S.; Lee, V.
Show abstract
Background: Higher education has been transformed by the rapid integration of generative artificial intelligence (GenAI) tools into academia. The objective of the present study was to examine how and for what purposes senior undergraduate dental students use GenAI tools in academic assignments. Methods: This cross-sectional study uses data from three written assignments submitted by two consecutive cohorts of graduating fourth-year dental students at the Faculty of Dentistry at the University of British Columbia, for a total of 120 students. The assignments focused on different subjects where students had to offer their views, including community water fluoridation. When using GenAI, students were asked to disclose whether and how such tools were used, and for what purpose. Descriptive statistics (e.g., means, frequencies, and proportions) were conducted via IBM SPSS Statistics (Version 27.0). Results: From the two cohort of students, 102 (85%) disclosed the use of GenAI tools in at least one assignment; of these, 69 (67.6%) reported using these tools in all three assignments. ChatGPT was by far the most frequently used GenAI tool, reported by 89 students (87.2%). Nine students (8.8%) did not specify which tool they had used. The majority of the students (91.2%, n = 93) reported using GenAI for proofreading or grammatical editing. About 9.8% of the students (n = 10) reported more substantive uses, such as relying on GenAI to generate in part or in full the assignment, and/or assessing the credibility of references. Conclusions: In our study, the use of GenAI tools was highly prevalent among senior undergraduate dental students for editorial purposes. A smaller but notable proportion of students engaged in more substantive uses that may carry academic and ethical risks. There is a need for structured AI literacy training and clear, dentistry-specific guidelines to promote responsible and transparent use while safeguarding critical thinking, academic integrity, and professional judgment in dental education.
Khan, D. Z.; Mao, Z.; Hudson, G.; Wijekoon, A.; Chen, J.-e.; Borg, A.; Dorward, N.; Blandford, A.; Clarkson, M.; McCulloch, P.; Bano, S.; Stoyanov, D.; Marcus, H.
Show abstract
Background Endoscopic pituitary surgery involves navigating high-stakes anatomy where complications, such as carotid artery injury, cause devastating morbidity. While computer vision AI offers potential for real-time anatomical recognition to mitigate these risks, successful translation requires rigorous human-factors and performance evaluation. We present the iterative development and preclinical evaluation of a surgeon-controlled, real-time AI-assisted navigation system. Methods Guided by IDEAL Stage 0 and DECIDE-AI frameworks, the study was conducted in two phases. Phase 1 was an exploratory study where surgeons used the system during high-fidelity simulated surgery and provided feedback via "Think Aloud" protocols and surveys. Following prototype iteration, a Phase 2 randomized crossover comparative trial was conducted with 19 neurosurgeons (15 trainees, 4 experts) performing high-fidelity simulated tumour resections with and without AI assistance, separated by a minimum 2-week washout. The primary outcome was surgical technical performance (OSATS). Workload, educational value, usability, trust, and implementation outcomes were also assessed. Results Phase 1 informed hardware, model, and interface refinements, including optimized pedal-controlled overlays and prediction confidence metrics. In the comparative trial, AI assistance significantly improved overall technical performance (OSATS 19.79+/-4.06 vs. 17.32+/-4.11; p=0.027). This gain was experience-dependent; AI significantly augmented trainee performance (19.20+/-3.76 vs. 16.60+/-3.78), narrowing the proficiency gap, while expert performance remained high and stable. 100% of participants identified the system as a useful training tool. However, subjective workload was significantly higher in the AI arm (SURG-TLX 26.42+/-9.56 vs. 22.26+/-7.81; p=0.014). Despite this, usability (SUS 75.13+/-14.31) and implementation feasibility, acceptability, and appropriateness scores were consistently high (means >4.4/5). Conclusions This study provides a stepwise process for real-time AI development using pituitary surgery as a high-stakes exemplar. The refined surgeon-centric AI system improves training and technical performance, particularly for trainees. Next steps involve first-in-human studies and further exploration of longer-term human factors such as over-reliance, cognitive overload mitigation and trust calibration.
Mendu, M.; Tesh, R. A.; Pellerin, K.; Steward, G. E.; Cerda, I. H.; Williams, M.; Colman, M.; Shah, S.; Lam, A. D.; Cash, S. S.; Westover, M. B.; Kimchi, E. Y.
Show abstract
Delirium, a dynamic neuropsychiatric condition associated with morbidity and mortality, remains underdiagnosed due to reliance on subjective, intermittent screening tools. Objective and potentially continuous identification is needed to improve clinical care. We developed and validated an analytic framework for delirium classification based on automatically extracted video features. In this prospective cohort study, patients ([≥] 18 years) admitted to the inpatient medical or neurological ward of a tertiary academic center between August 2020 and March 2022 with an expected stay longer than one night were enrolled. Daily structured delirium assessments and brief video recordings were performed in consenting patients. Videos were analyzed using deep learning pose estimation to extract keypoints and calculate behavioral features based on eye, face, and limb postures and movements. Four machine learning models (logistic regression, gradient boosting, support vector machines, and random forests) were trained to predict delirium status from extracted features. Model performance was evaluated on 20 repetitions of three-fold cross-validation using the area under the curve of the receiver operating characteristics curve (AUC ROC). The cohort included 109 videos from 25 male and 25 female participants (median age: 72, IQR: 63.25-78). Twenty videos (18%) were from patients with delirium. Keypoints for this dataset were more accurately extracted using a customized ResNet-101 model developed with DeepLabCut (sensitivity 0.94, specificity 0.89, compared to human-labeled gold standards) than using off-the-shelf models. Keypoints were then used to generate behavioral features summarizing movement and postures throughout the video. A support vector machine model achieved an average delirium classification AUC ROC of 0.79 (SD {+/-} 0.09), sensitivity of 0.71 (SD {+/-} 0.16), and specificity of 0.78 (SD {+/-} 0.07). This study demonstrates the feasibility of identifying delirium using brief videos in clinically heterogeneous cohorts and reveals novel features for objective identification. Author SummaryDelirium is a sudden change in attention and awareness that commonly affects hospitalized patients. It is linked with longer hospital stays, cognitive decline, and death. Patients with delirium often show changes in movements and behaviors such as slowed movement, restlessness, or excessive scanning of the environment. Since current screening tools rely on intermittent human interactions, they can be subjective and miss the fluctuating nature of delirium, leading to underdiagnosis. We sought to explore whether short video recordings could be used to detect delirium automatically. In our study, we enrolled 50 hospitalized patients and conducted daily delirium assessments and video recordings. We used a machine learning model to analyze patients eye movements, facial expressions, and body postures. We found that video-derived features could be used to identify delirium in a small clinical cohort. While needing further validation in outside cohorts, this study shows an important proof-of-concept for objective delirium monitoring in heterogeneous clinical contexts without adding burden to clinical staff.
Hudson, G. R.; Khan, D. Z.; Fayez, F.; Bhatia, S.; Bano, S.; Costanza, E.; Blandford, A.; Stoyanov, D.; McCulloch, P.; Marcus, H. J.; University College London Collaborators,
Show abstract
Background: Endoscopic endonasal transsphenoidal surgery (EETS) requires navigation around neurocritical anatomy. Today, artificial intelligence clinical decision support systems (AI-CDSSs) can orientate surgeons, but clinician trust in AI remains unclear, limiting safe deployment. This study evaluates how modifiable design affects trust and performance in a real-world pituitary surgery AI-CDSS. Method: Online, 70 clinicians with pituitary surgery experience were randomised evenly to a Basic or Enhanced AI-CDSS which outline the sella on EETS operative video. The Enhanced group additionally received explanation of the model and previous publications, alongside confidence labels depicting outline reliability. Both groups annotated the sella on six video clips, first alone then with the optional AI-CDSS. Clips were ordered by declining AI performance, except for the final clip. Self-reported trust was measured using a 1-7 scale after each annotation, and performance was the DICE overlap between user annotations and the ground truth. Comparisons used Mann-Whitney U and permutation analysis. Results: Sixty-four participants (91%) finished the exercise (31 Basic, 33 Enhanced). When AI performed best, median trust was 5.00 in both arms (U=559, p=.521). However, when AI performed worst, trust was significantly lower for the Enhanced group (3.00 vs 3.67, U=668, p=.035), sustained in the final clip (3.67 vs 4.33 U=687, p=.019). User performance improved with the AI-CDSS, but with no significant difference between the groups on the best or worst AI performing clips. Nevertheless, for the best AI, senior clinicians had higher median performance in the Enhanced group (0.95 vs 0.90, U=75, p=.066). There was also less dispersion in the Enhanced group when AI was inaccurate (IQR: 0.07 vs 0.21, p=.004). Conclusion: Interface design can improve trust calibration in a surgical AI-CDSS and may increment performance in seniors when AI is accurate, and consistency when AI is inaccurate. In future, these features may form important safety checks during translation to the operating room.
Thabane, A.; McKechnie, T.; Staibano, P.; Scheau, C.; Dragosloveanu, S.; Guerra Farfan, E.; Sajol, R. R.; Arora, V.; Calic, G.; Parpia, S.; Busse, J. W.; Hamoudi, N.; Patel, D.; Reiter-Palmon, R.; Bhandari, M.
Show abstract
Introduction Creativity is important in surgery for problem-solving in the operating room and the development of surgical innovations that improve patient outcomes. However, our limited understanding of what the characteristics and competencies of the highly creative surgeon are has inhibited our ability to develop the tools, programs and interventions necessary for cultivating the creativity of surgeons. We present the protocol for the INSPIRE Study, which aims to identify the factors associated with high creative achievement in surgeons. Methods and Analysis We have designed a sequential mixed-method study, including a cohort study accompanied by qualitative semi-structured interviews. The primary objective of this study will be to identify factors associated with high creative achievement in surgeons, to be assessed through direct involvement in innovation or invention, or a top score (10 out of 10) on any domain in the Inventory of Creative Activities and Achievements questionnaire. We plan to measure 39 different personal, domain-specific, domain-general, and environmental/motivational variables, chosen based on previous literature and on exploratory grounds, to be assessed as possible factors of creative potential. Multivariable logistic regression is planned, with high creative achievement as the dependent variable and all 39 potential factors of creative potential as independent variables. Ethics and Dissemination Ethics approval from the Hamilton Integrated Research Ethics Board has been obtained and no harm is expected due to participation in this study. To facilitate knowledge translation, we plan to publish the feasibility data and results in peer-reviewed journals, and present at international surgical and creativity conferences.
Rajani, M. I.; Yaya, H.; Vandehei, E.; Di Passa, A.-M.; McIntyre-Wood, C.; Prokop-Millar, S.; Krzyzanowski, D.; Zhang, M.; Fein, A.; MacKillop, E.; De Jesus, J.; Frey, B.; MacKillop, J.; Duarte, D.
Show abstract
Background:Mild neurocognitive disorder (NCD) is a condition in which individuals experience mild cognitive decline but are independent in their activities of daily living. Due to the increasing number of people living with mild NCD and its negative impact on the quality of life, it poses a significant health burden worldwide. Thus, it warrants an urgent need for innovative approaches to address the lack of effective treatment options. Deep transcranial magnetic stimulation (dTMS), a non-invasive neuromodulation technique approved for the treatment of various neuropsychiatric disorders, could serve as a novel intervention for mild NCD. It can stimulate deeper and broader areas of the brain implicated in mild NCD, such as the prefrontal cortex, insula, and anterior cingulate cortex. Objectives:This study will examine the feasibility and tolerability of the Health Canada and Food and Drug Administration (FDA) approved dTMS coils (H1, H4 and H7 coils) in individuals with mild NCD. Secondarily, it will assess the impact of dTMS on cognition, mood, sleep, anxiety, brain activity (via electroencephalography), and blood biomarkers of neurodegeneration and inflammation. Methods: This open-label pilot study will recruit a total of N=30 participants between the ages of 60-90 with mild NCD. Participants will be assigned to one of the three dTMS coil conditions (H1, H4 & H7) and will complete a total of 20 dTMS sessions over 6 weeks. Data will be collected before, during, immediately after, and one-month following the intervention period. Discussion: This pilot study will generate necessary evidence regarding the feasibility and tolerability of dTMS in mild NCD. This will be used to determine whether a definitive trial is justified and inform the trial procedures. In the long term, dTMS may address a critical gap in therapeutic options for mild NCD. Clinical Trial registration:The protocol was registered on Clinicaltrials.gov (CT07038798) on June 2nd, 2025.
Khan, D. Z.; Adams, T.; Wijekoon, A.; Ramirez Herrera, R.; Bano, S.; McCulloch, P.; Stoyanov, D.; Clarkson, M. J.; Costanza, E.; Blandford, A.; Marcus, H.; CARES Evaluation Group,
Show abstract
Artificial intelligence (AI) decision support systems for surgery hold promise but face barriers to adoption, particularly around the interpretability of their outputs. We conducted an international cross-sectional survey of 47 neurosurgeons to evaluate perspectives on literature-derived explanation techniques for AI-generated anatomical segmentations, using endoscopic pituitary surgery as a high-risk exemplar. Participants ranked certainty scores, certainty maps, saliency maps, scene similarity scores, and nearest-neighbour illustrations, and rated them using a modified Explanation Satisfaction Scale alongside free-text feedback. Certainty-based techniques were consistently ranked and rated highest for interpretability - valued for aligning with surgical decision-making by conveying confidence (via scores) and anatomical boundaries (via maps). Saliency- and similarity-based methods were judged less clinically relevant and better suited to educational settings. Certainty-based explanations, therefore, appear most acceptable to surgeons for clinical integration of decision support systems, though their impact on AI acceptability, trust calibration, and performance requires prospective evaluation across surgical domains.
Sindhi, N. A.; Pawar, N.; Dixson, J.; Garcia, D.
Show abstract
Predicting protein-protein interactions is a fundamental problem in molecular biology. Experimental approaches for identifying protein-protein interactions are time-consuming and labor-intensive, motivating the development of efficient computational alternatives, including machine learning-based methods. However, conventional machine learning methods often rely on manually engineered features that require substantial domain expertise. In this study, we propose a two-stage framework to address these limitations. In the first stage, a one-dimensional convolutional neural network autoencoder is used to automatically learn latent representations from protein sequences. The quality of these features is evaluated through reconstruction error, reflecting how accurately the model reconstructs the original sequence. In the second stage, these learned features are combined with amino acid frequency-based features to form a hybrid feature set for predicting protein-protein interactions. A systematic comparison is performed between models trained on frequency features alone and those using a hybrid representation. The comparison showed that incorporating one-dimensional convolutional neural network-derived latent features improved the models performance of predicting protein-protein interactions. The dataset was split into training, validation, and test sets. Nested cross-validation was employed, with inner loops for hyperparameter tuning and outer loops for model selection. The random forest classifier achieved the best performance, with a mean receiver operating characteristic-area under curve of 0.91 and a test F1-score of 0.87. These results highlight the effectiveness of integrating deep feature learning with ensemble methods for predicting protein-protein interactions and build upon previous work focused on this fundamental problem. Author SummaryProtein-protein interactions are fundamental in all biological processes. However, predicting these interactions is a key problem in molecular biology. Computational approaches have been tested to address this problem. We applied a mix of machine learning and deep learning to gain insight into the qualities of proteins that engage in interaction. First, we trained a deep learning model, which automatically learned the primary sequence and characters related thereto, reducing bias in the actual prediction process. We combined these features, or latent representations, with amino acid frequency features of protein sequences, and called the two together "hybrid features." Then we performed a systematic comparison of amino acid frequency features-only with hybrid features, among four different machine learning classifiers. Our results suggest that the random forest classifier performed best among all four classifiers at predicting interactions between proteins. We propose that this approach could be used to improve efficiency in testing protein-protein interactions at the bench and may have applications to other biologically relevant molecular interactions.
Zilyte, A.; Petrikaite, V.
Show abstract
In this study, we evaluated the impact of different in vitro 3D culture modelling methods on the activity of doxorubicin (DOX) and 5-fluorouracil (5-FU) in human melanoma spheroids. Human melanoma A375 and IGR39 spheroids were generated using the hanging drop and non-adhesive surface methods. Spheroid growth dynamics were assessed by measuring changes in spheroid diameter. To compare the effects of anticancer drugs in spheroids of different sizes, spheroids of approximately 200 and 400 {micro}m were formed. Drug activity was evaluated based on spheroid growth and cell viability using the MTT assay. A375 spheroids formed using the non-adhesive surface method were more sensitive to DOX than spheroids formed using the hanging drop method. In smaller A375 spheroids, 10 {micro}M 5-FU reduced cell viability more effectively in spheroids formed using the hanging drop method. In contrast, IGR39 spheroids formed by the hanging drop method were more resistant than those formed on a non-adhesive surface. However, in IGR39 spheroids, the effects of DOX and 5-FU on growth and viability did not significantly differ between formation methods. In conclusion, A375 spheroid growth was not significantly influenced by the formation method, whereas IGR39 spheroid growth depended on the method used. A375 spheroids formed on non-adhesive surfaces were more sensitive to DOX, whereas 5-FU activity depended on drug concentration and spheroid size. In IGR39 spheroids, the effects of DOX and 5-FU on growth and viability were largely independent of the spheroid formation method. Based on these results, it can be concluded that the researchers should carefully select the spheroid formation method for their studies, as this may influence the results of the tested compounds effect on their size and viability.
Sajjad, M.
Show abstract
Artificial intelligence (AI) tools have been rapidly adopted by medical researchers, yet whether early career researchers in low and middle income countries possess the awareness and habits needed to use these tools safely remains poorly documented. This study characterized AI adoption patterns, hallucination awareness, and verification and disclosure practices among early career medical researchers in Pakistan. A cross sectional anonymous online survey was conducted among medical students, house officers, residents, physicians, and faculty involved in research or academic work across Pakistan (May 2026). Descriptive statistics and chi square tests were applied to 373 eligible responses. AI use was near universal (99.7%), with 60.3% using AI tools daily. The most commonly reported tool in this sample was Claude (40.5%), followed by ChatGPT (29.2%) and Perplexity (26.0%), though this ranking likely reflects sampling characteristics. Despite high adoption, 59.2% typically did not verify AI outputs before use, and 40.2% had never heard that AI can generate fabricated scientific references. In behavioral vignettes, 36.5% assumed convincing AI generated references were authentic, and 54.2% would continue using remaining AI content after discovering one fabricated reference. Formal research training was strongly associated with consistent disclosure (51.7% vs. 17.1%; chi square=48.43, p less than 0.001). Role, daily use frequency, and research training were not significantly associated with verification behavior. Early career medical researchers in Pakistan demonstrate high AI adoption alongside incomplete hallucination awareness and infrequent verification, a pattern that may carry implications for research integrity. Formal training was the only factor significantly associated with consistent disclosure. Integration of AI literacy into medical curricula and institutional governance frameworks merits consideration.
Van De Vijver, E.; Decroix, K.; Burggraeve, D.; Van Wassenhove, P.; De Vos, Z.; Ampe, C.; Devisscher, L.; Van Vlierberghe, H.; Van Troys, M.
Show abstract
Background and aimsTherapeutic outcomes for advanced hepatocellular carcinoma remain inadequate, despite recent advances using immunotherapy. Long-term effectiveness of systemic therapies, including second-line multi-tyrosine kinase inhibitor sorafenib, is limited by resistance mechanisms and adverse effects. Upregulated deubiquitinase UCH-L1 is frequently correlated with poor prognosis in cancers. Here, we investigated the therapeutic potential of combining pharmacological UCH-L1-inhibition with sorafenib in HCC. MethodsUCH-L1 expression was analysed in TCGA-LIHC data and patient-derived HCC tissues. Sorafenib and LDN57444 effects were evaluated in vitro in cytotoxicity and invasion assays. Gene and protein expression were examined by RT-qPCR, Western blotting and immunohistochemistry. In vivo efficacy of drug synergy was assessed in an orthotopic xenograft mouse HCC model. ResultsIn silico data-analysis revealed significantly higher UCH-L1 levels in patient HCC tumours versus non-tumour, associated with reduced overall survival. Low-dose sorafenib upregulated UCH-L1 in HCC cell line Hep3B. Paradoxically, this also promoted invasiveness and sustained MEK1/2-ERK1/2-pathway activation. Combining low-dose sorafenib with LDN57444 produced strong synergistic cytotoxicity in vitro, reverted MAPK-activation and suppressed invasion. Consistently, at low sorafenib dose co-treatment with LDN57444 completely inhibited tumour growth of Hep3B xenografts and enhanced sorafenib efficacy. ConclusionLDN57444 sensitises HCC cells to low-dose sorafenib by reverting drug-induced pro-oncogenic signalling and thereby strongly synergises with sorafenib to enhance anti-tumour efficacy in a HCC mouse model. This presents UCH-L1 as a player in treatment-induced adaptive response and supports further exploring UCH-L1-targeting in combination with sorafenib as therapeutic avenue for advanced HCC. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=144 SRC="FIGDIR/small/725527v1_ufig1.gif" ALT="Figure 1"> View larger version (37K): org.highwire.dtl.DTLVardef@176dc91org.highwire.dtl.DTLVardef@8acae8org.highwire.dtl.DTLVardef@f71bborg.highwire.dtl.DTLVardef@1f3c5aa_HPS_FORMAT_FIGEXP M_FIG C_FIG Lay summaryThis study explores a new treatment approach for hepatocellular carcinoma (HCC) by combining two drugs: LDN57444, which blocks the enzyme UCH-L1, and sorafenib, a FDA-approved multi-tyrosine kinase inhibitor. We evaluated the effect of this drug combination in vitro using a HCC cell line and in an mouse HCC-model. The drug combination displayed strong, synergy in lowering HCC cell viability, and greatly reduced invasiveness and in vivo tumour growth. LDN57444 sensitised HCC cells to low doses of sorafenib by preventing UCH-L1-mediated activation of pro-oncogenic signalling. These findings highlight the potential of this new drug combination for treating advanced HCC thereby potentially reducing side-effects and countering drug resistance. Impact and implicationsOur preclinical research introduces a novel combination strategy against advanced HCC that holds potential to improve existing therapies, particularly the second-line multi-tyrosine kinase inhibitor sorafenib. The proposed combination of sorafenib with an inhibitor of the deubiquitinase UCH-L1 not only enhances sorafenib efficacy but present promise to also counter resistance mechanisms. Moreover, because effective responses are achieved at lower drug doses, this may in addition reduce therapy-associated adverse effects further increasing potential impact. While sorafenib is FDA-approved, the UCH-L1 inhibitor LDN57444 needs further (clinical) development to bring our promising findings to full translational potential for HCC patients and physicians.
Uegami, W.; Bisson, T.; Okoshi, E. N.; Costa da Silva, F. G.; Jiragawasan, C.; Zerbe, N.; Bychkov, A.; Fukuoka, J.
Show abstract
Interobserver variability in pathological assessments is a well-recognized challenge that impacts diagnostic reliability and disease understanding. This variability exists across many subspecialties due to the subjective nature of evaluations. Artificial intelligence (AI) applied to whole slide images has potential to standardize procedures and reduce variability in pathology, but transitioning to these technologies does not guarantee improvement. Establishing reliable ground truth datasets with consensus annotations is crucial for developing robust AI solutions. We introduce SortIT, an open-source web application that facilitates systematic creation and evaluation of ground truth image tile annotations. SortIT enables multiple annotators to independently label tiles, with flexible user permission controls. Annotated data can be exported for statistical analysis of observer variation and for creating ground truth datasets from consensus tiles. We outline protocols using SortIT for several use cases: (1) mitosis segmentation in tumor regions, (2) evaluating AI solutions for prostate cancer grading by comparing to expert consensus, and (3) granuloma classification by annotating discriminative tile-level features. Key strengths of SortIT lies in its ease of deployment, making it accessible and usable for a wide range of users. Overall, SortIT provides a valuable tool to establish high-quality ground truth datasets and comprehensively assess observer variability. Critical evaluation of ground truth quality using systematic annotation methodologies is crucial for developing accurate and generalizable diagnostic AI tools. Its open-source nature facilitates community adoption and further development.
Alickovic, F.; Lenz, S.; Ustjanzew, A.; Ortiz Rosario, L.; Vollmar, G. M.; Kindler, T.; Panholzer, T.
Show abstract
Introduction Coding tumor diagnoses from free-text clinical documentation currently requires substantial manual effort. Promising approaches for automating this process include large language mod-els (LLMs), embedding models, and retrieval-augmented generation (RAG). While previous studies often focus on a single method, we directly compare these approaches on a real-world dataset of tumor diagnosis descriptions to assess their strengths and limitations. Methods We evaluated nine different embedding models using similarity search and embedding-based classification, as well as LLM-based coding, with and without RAG, on a real-world dataset of 2,024 unique German tumor diagnosis descriptions labeled with ICD-10 and ICD-O topography codes. The retrieval knowledge base was constructed exclusively from stand-ardized Alpha-ID, ICD-10-GM, and ICD-O-3 classifications. Performance was assessed for exact (full-code) and partial (three-character) code prediction. For RAG, we evaluated base and fine-tuned versions of Llama 3.1 8B and Llama 3.3 70B. Results Qwen3-Embedding-8B, the largest embedding model, yielded the best results. It achieved 47.8% exact-match and 72.1% partial-match accuracy for ICD-10 coding with classification, and 42.7% exact-match and 73.5% partial-match accuracy for ICD-O coding with similarity search. The other embedding models, including medically specialized ones, showed varied but lower performance. RAG improved base LLM perfor-mance and outperformed embedding-based approaches on partial-match accura-cy (80.6% partial-match accuracy for ICD-10 and 75.0% for ICD-O with Llama 3.3 70B), but not on exact-match accuracy. Conclusion A direct comparison with embedding-based approaches is essential to determine whether the additional effort of RAG is justified. The strong variation in performance also highlights the importance of model selection. Further advances in embedding-based methods, potential-ly supported by larger and more diverse training data, may offer a promising direction for future work.
Ge, Z.; Liu, S.; Dou, W.
Show abstract
Background and ObjectiveNormative modeling is a key tool for understanding brain alterations in neurodegenerative diseases, such as cerebellar-type multiple system atrophy. However, existing methods lack interpretability and fail to capture clinically meaningful pathological changes. This study presents DINMC, a Deep Interpretable Normative Model Construction framework, which combines autoencoder-based learning with statistical hypothesis testing to better capture and interpret disease-specific neu-roanatomical changes. MethodsThe DINMC framework constructs normative models using neuroimaging data from multi-site large healthy cohorts. It utilizes a U-shaped convolutional autoencoder to train these models, which are then applied to reconstruct brain features from both patients and healthy controls within the same study cohort. Pathological confidence values are derived by fusing original and deviation feature spaces, offering a measure of disease-related pathology reflected in each dimension of the features. The framework was validated through statistical analysis and prognostic classification and regression tasks. ResultsThe pathological confidence provides valuable insights into the neuroanatomical regions most affected by the disease, as well as the correlation between changes in these regions and clinical assessment scales. Our optimal model outperform traditional methods in prognostic prediction tasks, with an AUC of 0.972 for classification tasks and an R2 of 0.432 for regression tasks. ConclusionDINMC provides a novel and interpretable framework for neuroimaging analysis. By combining deep learning and statistical hypothesis testing, this framework offers a unique solution to improving both the interpretability and performance of normative models in neuroimaging. The approach is scalable to other neuroimaging datasets, offering a versatile tool for broader biomedical applications.
Fang, H.; Tan, T.
Show abstract
Background: The development of personalised mRNA cancer vaccines holds considerable promise for oncology, yet a significant translational gap persists between neoantigen identification and the selection of therapeutically impactful targets. Current approaches predominantly prioritise human leukocyte antigen (HLA) binding affinity and immunogenicity, often overlooking the systems-level biological context of the target. This can inadvertently favour immunogenic but biologically peripheral peptides that exert limited influence on tumour signalling networks, thereby constraining vaccine efficacy. Furthermore, mRNA therapeutics must satisfy additional design requirements, including favourable codon usage and favourable secondary-structure stability, which directly affect in vivo translation and half-life. A unified computational framework that integrates neoantigen discovery with network biology is therefore critically needed. Results: Here, we present PimRNA, a Priority index (Pi)-centric computational medicine framework that bridges this gap by unifying neoantigen identification, mRNA sequence optimisation, and gene interaction network analysis. First, high-confidence tumour-specific HLA class I and II neoantigenic peptides are identified from paired tumour-normal genomic and tumour transcriptomic data using NeoDisc. Second, the coding sequences of these peptides are optimised for stability and translational efficiency with LinearDesign, yielding a core set of neoantigen-encoding mRNAs. Third, a random walk with restart algorithm is applied to a knowledgebase of gene interactions to identify peripheral genes exhibiting significant network connectivity to core genes, generating a gene-predictor matrix in which each gene is assigned an affinity score reflecting its network proximity to immunogenic neoantigens. These scores are consolidated into a single, unified priority rating (0-5) for each gene, followed by subnetwork analysis that reveals therapeutically relevant gene modules. Application of PimRNA to breast cancer and melanoma datasets demonstrates that it successfully selects high-confidence immunogenic neoantigen candidates embedded within biologically meaningful tumour-specific networks. Conclusion: PimRNA provides a systems biology foundation for mRNA vaccine design, moving beyond isolated immunogenicity to prioritise targets that are both highly presented and central to tumour-relevant biological networks. This framework offers a generalisable strategy for the rational discovery and prioritisation of mRNA therapeutics, significantly advancing the field of computational medicine towards personalised cancer vaccines.
Dahlberg, A. C. H.; Tapiola, O.; Luisto, R.; Puranen, T.; Sanmark, E.; Vartiainen, V.
Show abstract
Background: Embedding models are an integral part of generative AI architectures, transforming text into embedding vectors that represent semantic content in numerical form. Despite their central role, their performance in clinical settings remains underexplored. We evaluate embedding models across two tasks: semantic difference detection in clinical texts, and data retrieval from patient records. Methods: Eight models were applied to synthetic discharge summaries in English, Finnish, and Swedish. Semantic sensitivity was assessed by introducing controlled perturbations (deletion, modification, and paraphrasing) at three levels of severity; cosine similarity, and L1 and Euclidean distances were computed between the vectors of the original and perturbed texts. Partial vectors were compared to explore dimensionality reduction. Two models with the biggest contrast in semantic difference detection were evaluated on retrieval of relevant information from real Finnish vascular surgery records. Results: Embedding vectors captured semantic differences in clinical text: content deletion and modification produced larger increases in vector distance than paraphrasing. On average, models detected the direction of semantic change correctly, but case-level performance varied considerably. Qwen3-Embedding-8B was the only model with zero directional errors, while multilingual-E5-large erred in 13.8% of cases. In data retrieval, Qwen3-Embedding-8B again outperformed multilingual-E5-large, though the margin was narrower: sufficiency scores were 3.25 vs. 3.17 out of 5 for the first query and 2.25 vs. 1.15 out of 5 for the second query. For some models, as few as 0.6-1.2% of dimensions sufficed to replicate full-vector accuracy; principal component analysis and coordinate-level analysis did not account for this finding. Conclusions: Our results show that the choice of embedding model is important: performance differences between models can be large enough to determine whether clinically relevant information reaches the end user, and model weaknesses can be both task-specific and context-dependent.
Bertin, D.; Bongrand, P.; Bardin, N.
Show abstract
In view of the outstanding progress of machine learning (ML) and growing cost of health systems, it is a current challenge to incorporate artificial intelligence tools into actual medical practice. Here we explored the feasibility and reliability of using machine learning to perform an important immunological investigation that currently requires experienced biologists : Anti-nuclear cytoplasmic antibodies (ANCAs) are important markers for vasculitis and they may be evidenced by microscopic examination of cells labeled with patients' sera. The use of a reliable ML classifier to discriminate between positive and negative samples would increase the rapidity and decrease the cost of immunofluorescence-based ANCA detection. Here, we tested seven well-documented ML algorithms, ranging from simple models such as k nearest neighbors to more complex convolutional neural networks involving millions of adjustable parameter. We studied the feasibility and reliability of classifying 1114 serum samples that had been collected for about 3 years and assayed with conventional procedure. We compared four strategies consisting of assaying either whole microscope fields or individual cell images, and natural images or histograms. The following conclusions were obtained : (i) Several different strategies allowed us to build models stable enough to discriminate between positive and negative samples collected during about 27 months, with a comparison to human classification yielding a kappa index of about 0.7, that may be considered as fairly good and intermediate between the performance of junior and senior biologists. (ii) Simpler ML models combined with theoretical thinking might provide the most rapid and efficient way of developing a reliable test within the framework of a single institution. (iii) In addition, the interpretability of the simplest model provided some theoretical insight into important classification parameters. (iv) An important point and caveat is that the multiplicity and versatility of currently available tools make it an essential requirement to test repeatedly a given model, that must be chosen as simple as possible, to achieve a reliability compatible with medical use. It is concluded that our study provides a strong incentive to incorporate ML tools in well defined medical tests, which might reduce the risk of human errors and pave the way to fully automatic procedures.
Zhang, F. y.; Yao, J.; Zhou, Q. y.; fang, Y. c.; Hu, A.; Wang, Y.; Ding, W.; Wu, X.; Gu, Y.
Show abstract
Robot-assisted hematoma puncture has seen significant development in primary hospitals across the country. Sino Plan software system is the core of the intelligent surgical robot, independently developed by Sinovation.We conducted a comparative study of imaging indicators, such as residual hematoma volume and hematoma clearance rate, as well as prognostic indicators, in patients who underwent hematoma puncture at our hospital over a 9-year period, before and after the introduction of Sino Plan.The results indicated that following the application of Sino Plan, the hematoma clearance rate was significantly enhanced, and the residual hematoma volume was markedly reduced. Regarding patient prognosis, there was no significant difference in GCS scores between the two groups, but the incidence of adverse prognostic events was lower in patients where Sino Plan was utilized.In conclusion, this 9-year retrospective analysis at our hospital reveals that Sino Plan offers distinct advantages. However, its application in certain special cases suggests that further improvements to the software are warranted to better meet the demands of more specific clinical scenarios.
TSUKADA, Y. T.; Hirayama, H.; Yodogawa, K.; Murata, H.; Iwasaki, Y.-k.; Fujino, T.; Shiozawa, A.; Tsukada, S.
Show abstract
Deep-learning ECG analysis is advancing rapidly but lacks stable, physiologically interpretable indicators to anchor explainable artificial intelligence (AI). Tensor cardiography (TCG) models electrocardiographic (ECG) waveforms as differences between pairs of cumulative distribution functions (CDFs), representing collective myocardial action potential transitions. However, the original 4-CDF model has limitations in fitting P waves and complex QRST patterns. This study aimed to evaluate whether increasing the number of CDFs from 4 to 10 improves TCG fitting accuracy and to characterize normative distributions of 10-CDF parameters in healthy individuals. Participants were recruited through occupational health screening at Tobu Railway Co., Ltd. (n = 415) and from the Nippon Medical School Hospital ECG database (n = 29). Standard 12-lead ECGs from 444 healthy participants, including 345 men and 99 women with a mean age of 46.9 years, were analyzed using TCG software. Reconstruction accuracy was assessed using RMSE, paired t-tests, and Cohens d. The 10-CDF model achieved significantly lower RMSE values across all leads than the 4-CDF model, with all p values < 0.0001 and very large effect sizes. In representative leads, RMSEs for the 4-CDF versus 10-CDF models were 0.0256 versus 0.0061 in lead II, 0.0230 versus 0.0063 in lead V1, and 0.0265 versus 0.0062 in lead V5. The coefficient of determination improved from a median of 0.952 with the 4-CDF model to 0.997 with the 10-CDF model in lead II. Parameter dispersion was reduced, suggesting improved estimation stability. Two new parameters, T_mean_diff and RT_mean_duration, were derivable from the expanded model; RT_mean_duration showed significant correlations with age and body surface area. In conclusion, increasing the CDF resolution from 4 to 10 significantly enhanced ECG waveform reconstruction accuracy and parameter stability. These findings provide normative distributions of 10-CDF TCG parameters and may support future explainable AI-based ECG analysis.